The JH4 intron within the immunoglobulin locus is subject to somatic hypermutation and therefore varies in sequence compared with the germline / reference.
This locus has been amplified as a 433 bp amplicon and sequenced with 2x250 bp PE reads
The goal is to align the reads with the reference sequence, quantify the number of mutations per amplicon, plot a distribution of mutational load within each sample, then compare the distribution between samples.
In analysis version 1, we took two approaches to characterise the mutations within the reads:
For the CrispRVariants approach, the package didn’t seem to work particularly well for the type of data here. The analysis presumed that the reads would entirely span the region under investigation, and referenced mutations to a guide sequence / location. The notation of mutations was also pretty difficult to understand.
For the Mutect2 approach - this gave the number of variants called per sample, but it did not give the number of variants called per read.
In light of these previous shortcomings, we’ll instead turn to a customised method for assessing the number of mutations per read in each sample
library(readr)
library(ggplot2)
library(tidyverse)
library(data.table)
This is the same as in the first version of the analysis. We’ll stitch together the forward and reverse reads into a single sequence for the amplicon. This step involves using the commandline program bbtool. In the bbtools package, there is a function called bbmerge that overlaps reads into consensus reads.
#!/usr/bin/bash
module purge
module load Miniconda3/4.7.10
conda activate bbtools
bbmerge.sh in=$read1.fastq.gz in2=$read2.fastq.gz out= merged_reads/$sample.fastq.gz
This produced a folder containing the consesnus reads for each sample
merged_reads/*.fastq
In this work, we’re interested in how many mutations are present per read. To simplify the input, we can collapse the read sequences into identical sets and count how many individuals are in each set. We can use a program called fastx_collapser for that
#!/usr/bin/bash
module purge
module load fastx_toolkit (Get the proper modue name for this)
fastx_collapser -v -Q 33 -i merged_reads/input.fastq -o collapsed/collapsed_output.fasta
This produces a folder with the collapsed read sequences. In the fasta header for each sequence, we’re given the rank order of the sequence (in terms of abundance) and the number of reads that share that sequence
collapsed/*.fasta
Following this, we can use MAFFT to run a multiple sequence alignment on the fasta files for each sample. We can use the -add argument of MAFFT to align the reads against the reference
#!/usr/bin/bash
module purge
module load MAFFT (Get the proper modue name for this)
mafft --auto --add collapsed/input.fasta reference.fasta > mafft/aligned.aln
Here, we’ve produced a set of multifasta format alignments (one for each sample):
mafft/*.aln
Within each alignment, we have the aligned version of the reference. We can walk over each of the collapsed read sets and compare each position to this reference - where we have a substitution we can count it, where the reference is a - and the experimental has a base we can count it as an insertion, where the reference has a base and the experimental has a - we can count it as a deletion. We can also keep track of each substitution individually to produce the tables as in the paper. I did this using a simple Perl script, but it could be done in R (but probably quite a lot slower).
This results in a file named mutation_counts.tsv. We can
load that here and take a look at what it looks like (we’ll just look at
the first row because it’s a massive table and takes a long time to show
the whole thing)
working.dir <- '/Volumes/babs/working/goldstr2/projects/caladod/anqi.xu/DN19023-ax343/'
setwd(working.dir)
dd <- read.csv('mutation_counts.tsv', sep="\t")
dd <- data.table(dd)
knitr::kable(dd[1,])
| sample | rank | count | snps | deletions | insertions | total | length | sequence | conversions |
|---|---|---|---|---|---|---|---|---|---|
| DCKP93B_BM_5DPOG | 1 | 343 | 0 | 0 | 0 | 0 | 433 | -ccccactccactc-tttgtccctatgcatcggatactgtataaatgctgtcacagaggtggtcctgaagtatattccacaactaatacttttattctaaaaactgaaaatctccaactacagccccaactatccctccagccataggattgttttagcatcttcttcaaatgagcctccaaagtccctatcccatcatccagggactccaccaacaccatcacacagattcttagtttttcaaagatgtggagataatctgtcctaaaggctctgagatccctagacagtttatttcccaacttctctcagccggctccctcagggacaaatatccaagattagtctgcaatgctcagaaaactccataacaaaggttaaaaataaagacctggagagg—ccattcttacctgaggagacggtgactgaggttcc—– | A:C=0;A:G=0;A:T=0;C:A=0;C:G=0;C:T=0;G:A=0;G:C=0;G:T=0;T:A=0;T:C=0;T:G=0 |
The entry we can see here is from the sample DCKP93B_BM_5DPOG, it is the most abundant sequence in the merged fastq file (rank = 1), this sequence was found in 343 reads in this sample, relative to the reference, it has 0 snps, 0 insertions and 0 deletions. The total (edit) distance is 0 - the sequence is identical to the reference. The aligned sequence is shown (sequence) alongside a map of the substitutions (conversions)
To explore any groups later on, we’ll add some columns to our dataset that give the group assignment
my.samplename.list <- strsplit(dd$sample,'_')
my.tissue <- vector(length = length(my.samplename.list))
my.time <- vector(length = length(my.samplename.list))
my.group <- vector(length = length(my.samplename.list))
for (i in 1:length(my.samplename.list)) {
my.group[i] <- paste(my.samplename.list[[i]][-1], collapse="_")
my.tissue[i] <- my.samplename.list[[i]][2]
my.time[i] <- my.samplename.list[[i]][3]
}
dd$group <- my.group
dd$tissue <- my.tissue
dd$time <- my.time
First we’ll just take a look at the raw data to get a feel for what we are dealing with
First, we’ll look at the distribution of sequence lengths - we’re expecting the sequences we have to be ~ 433 bp long as this is the length of the JH4 intron that was amplified. Anything that deviates substantially from that is likely to be garbage
ggplot(dd, aes(x=sample, y=length)) +
geom_point(aes(colour=time, shape=tissue)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
There looks like 3 different populations of sequences here - the on-target product ~ 433 bp, a non-specific but common product at ~ 150 bp, and some trash < 110 bp
It would be interesting to see the gel from the PCR for these products to see if the ~150 bp product can be seen. It’s possible the PCR could be optimised / gel extraction performed to purify the target product
We already know from the initial analysis that some of the samples have low read counts, but we can confirm it in the current data by plotting the sum of each ‘count’ per sample
ggplot(dd, aes(x=sample, y=count, fill=tissue)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Similar to before, some samples do have very low numbers of reads - this is likely be be problematic for estimating mutations, as we’re likely to detect fewer mutations in these just by chance
ggplot(dd, aes(x=length, y=count)) +
geom_point(aes(colour=time, shape=tissue)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
This is quite interesting, as we can see our target ~433 bp sequence is present across a range of counts, whereas the off-target sequences tend to be lower in abundance
We’ll apply these filters later
With Mutect2, we saw that samples that had less than 400 reads had pretty unreliable mutation counts - probably the result of PCR amplification errors from very low input samples
In our data, we’ve tracked the number of SNPS, the number of indels and the total number of mutations (we can call that the edit distance). Here we’ll look at To do this, we’ll SNPs rather than edit distance since SNPs are probably a more reliable estimate of ‘true’ mutations
dd.1 <- dd[, list(sum(count),mean(snps), unique(tissue), unique(time)), by=list(sample)]
colnames(dd.1) <- c('sample', 'sum.count', 'mean.snps', 'tissue', 'time')
ggplot(dd.1, aes(x=mean.snps, y=sum.count)) +
geom_point(aes(colour=time, shape=tissue)) +
geom_hline(aes(yintercept=400), color="blue", linetype="dashed")
Similar to our Mutect2 data, samples that have less than 400 reads may be unreliable in the number of mutations that are called
Let’s take a look at what the affected samples are - unlike the previous suggested filter, which will remove sets from within samples, here we’ll be losing entire samples
too.few.reads <- dd.1$sample[which(dd.1$sum.count < 400)]
print(too.few.reads)
## [1] "DCKP92E_BM_5MPOG" "DCKP92f_BM_5MPOG" "DCKP92h_BM_5MPOG"
## [4] "DCKP92f_SP_5MPOG" "DCKP92g_SP_5MPOG" "DCKP92h_SP_5MPOG"
## [7] "DCKP103D_SP_9MPOG" "DCKP106F_SP_9MPOG" "DCKP106E_BM_9MPOG"
At this point I think it’s worthwhile to start applying the filters.
I think we’re reasonably sure that the sequences < 150 bp are not
useful (this can also be examined in the aligned sequenced in
dd)
Reluctantly, I think it’s also worthwhile ditching the samples that have less than 400 reads. I don’t think that, on balance, we can have much faith in the data from these
dd.filtered <- dd[which(dd$length > 400),] ## take only the sequences that are > 400 bp
dd.filtered <- dd.filtered[!c(dd.filtered$sample %in% too.few.reads),] ## remove samples with less than 400 reads
With those filters applied, we’ll take a look at the total edit distance for a set. I suspect there’s still some contaminating sequences here and that should be evident from having sets that are still wildly different from the reference
ggplot(dd.filtered, aes(x=length, y=total)) +
geom_point(aes(colour=time, shape=tissue))
Here, we can see that a few reads have edit distances that are really
very large (> 200). There’s not too many, so we might have a look at
these more closely
knitr::kable(dd.filtered[which(dd.filtered$total > 100),])
| sample | rank | count | snps | deletions | insertions | total | length | sequence | conversions | group | tissue | time |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DCKP93C_BM_2MPOG | 743 | 1 | 177 | 69 | 48 | 294 | 412 | —–acgctgactgaccggtgtggctctttaacaacctttgcttgtcccgataggtcacctttggctcttcagagatgcagggacacacgatgggcttctggttaaacaaactgaattatttgtgccat–ctctcaatgttgacggacagcctatttttgccaatatcacactgccaggtactga–cgtttta—-ctttttaaaaaga—–taaggttgttgtggtaagtacaggatagaccacttgaaacattaaacccagttctcaattttt-gcctgatgtcagg——————–cacgttatccaagctgtttgtatcctattctcttccat————–aaataaaatggaagtgatgtatttgtacgttatgtgttaaaggtgttatagtg—————–tctcaagagc—actttgggctcttaagagacaaggagc-agtcgatc—– | A:C=6;A:G=22;A:T=21;C:A=14;C:G=16;C:T=29;G:A=17;G:C=6;G:T=6;T:A=11;T:C=18;T:G=11 | BM_2MPOG | BM | 2MPOG |
Based on that (number of SNPs, number of indels, heterogeneity of sample source) I think it’s valid to filter those also
dd.filtered <- dd.filtered[which(dd.filtered$total < 100),]
We might expect ‘true’ mutations to be called in multiple reads in a sample. If a set is represented by only 1 read, we might suspect that this would be sequencing error or some other source of noise (of course, they could also be super low frequency mutations, but in that way inseperable from the noise)
We can take a look at if the number of reads per set interacts at all with the number of called mutations in that set. As we are interested in the lower counts, we’ll set the x axis limits of the plot so we can zoom into this area
ggplot(dd.filtered, aes(x=count, y=total)) +
geom_point(aes(colour=time, shape=tissue)) +
xlim(c(0,20))
We can see here that, yes, sets that are represented by only 1 read have a wide range of edit distances, including upwards of 35 mutations which is rarely seen at higher coverage
It’s probably necessary to settle on some sort of filter here. Very liberally, we can take a depth of 2x to say it’s a true set
dd.filtered <- dd.filtered[which(dd.filtered$count >= 1),]
After we’ve filtered our sequences down to what we hope are robust sets and samples - what has that done to our dataset?
# we can look at the number of rows in the table to see how many sets we're left with:
dim(dd.filtered)[1]
## [1] 45645
# Or look at that as a percent of the total number of set we started with:
dim(dd.filtered)[1] / dim(dd)[1]*100
## [1] 96.16965
It’s perhaps not fantastic news - we’ve lost around 90 % of our original sets - we’re left with 5269 sets across the samples
For interests sake, we can look at how many sets each sample now contains after filtering:
dd.filtered.1 <- dd.filtered[, list(length(count), unique(tissue), unique(time)), by=list(sample)]
colnames(dd.filtered.1) <- c('sample', 'num.sets', 'tissue', 'time')
ggplot(dd.filtered.1, aes(x=sample, y=num.sets, fill=tissue)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
So, it doesn’t look very promising here, I suspect that we’ll see a
strong relationship between the number of sets per sample and the total
number of reads
dd.filtered.1 <- dd.filtered[, list(length(count), sum(count), unique(tissue), unique(time)), by=list(sample)]
colnames(dd.filtered.1) <- c('sample', 'num.sets', 'num.reads', 'tissue', 'time')
ggplot(dd.filtered.1, aes(x=num.sets, y=num.reads)) +
geom_point(aes(colour=time, shape=tissue))
Ultimately, the diversity of mutations that we can find is extremely sensitive to the number of reads we’ve got - I think we really need more reads per sample to really make an assessment of this.
Bearing in mind the shortcomings with the data, we can still try to see if we can see any patterns in it that make biological sense. I imagine that we’re expecting to see mutations accrue over time
We’ll look at SNPs rather than edit distance. I think that we’ve probably filtered out most of the indels now anyway
dd.filtered.groups <- dd.filtered[,list(mean(snps), unique(tissue), unique(time)), by=list(group)]
colnames(dd.filtered.groups) <- c('group', 'mean.snps', 'tissue', 'time')
dd.filtered.groups$group <- factor(dd.filtered.groups$group, levels = c('BM_5DPOG', 'BM_2MPOG', 'BM_4MPOG', 'BM_5MPOG', 'BM_9MPOG', 'SP_5DPOG', 'SP_2MPOG', 'SP_4MPOG', 'SP_5MPOG', 'SP_9MPOG', 'SP_NEGCTRL', 'SP_POSCTRL'))
ggplot(dd.filtered.groups, aes(x=group, y=mean.snps, fill=tissue)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
If this isn’t what was expected (I suspect it isn’t), then I think that this it probably due to the quality of the underlying dataset
The paper at
https://www.jimmunol.org/content/jimmunol/early/2019/08/09/jimmunol.1900483.full.pdf
had some pie charts showing samples and the proportion of reads with 1,
2, 3, 4, 5, 6 and 7+ mutations. Pie charts are pretty poor ways to
represent data (it’s quite difficult to estimate area accurately in a
pie chart, it’s probably better shown in a proportional stacked bar
graph)
First we’ll initialise a data.frame to store results, and then we’ll count how many SNPS in each of the groups in our categories
my.results <- data.frame('group' = unique(dd.filtered$group), 'zero' = 0, 'one' = 0, 'two' = 0, 'three' = 0, 'four' = 0, 'five' = 0, 'six' = 0, 'seven.plus' = 0)
for (i in 1:length(my.results$group)) {
my.group <- dd.filtered$snps[which(dd.filtered$group == my.results$group[i])]
my.results$zero[i] = length(which(my.group == 0))
my.results$one[i] = length(which(my.group == 1))
my.results$two[i] = length(which(my.group == 2))
my.results$three[i] = length(which(my.group == 3))
my.results$four[i] = length(which(my.group == 4))
my.results$five[i] = length(which(my.group == 5))
my.results$six[i] = length(which(my.group == 6))
my.results$seven.plus[i] = length(which(my.group >= 7))
}
Next we’ll set up another function to plot our pie chart, as we don’t want to type it out over and over again and use it to plot our pies
plot.pie <- function(x) {
my.melt <- reshape2::melt(x)
ggplot(my.melt, aes(x="", y=value, fill=variable))+
geom_bar(width = 1, stat = "identity")+
coord_polar("y", start=0) +
theme(axis.text.x=element_blank()) +
ggtitle(x$group)
}
plot.pie(rev(my.results[1,]))
plot.pie(rev(my.results[2,]))
plot.pie(rev(my.results[3,]))
plot.pie(rev(my.results[4,]))
plot.pie(rev(my.results[5,]))
plot.pie(rev(my.results[6,]))
plot.pie(rev(my.results[7,]))
plot.pie(rev(my.results[8,]))
plot.pie(rev(my.results[9,]))
plot.pie(rev(my.results[10,]))
plot.pie(rev(my.results[11,]))
plot.pie(rev(my.results[12,]))
We want to set up a function that will allow us to parse out the substitution tables from the data. We’ll make this as a function so we can run it easily multiple times (if we need to). We can then use that function to produce our tables of substitutions for each sample. Here, we’re using the filtered data, but the function makes it easy to run if we wanted to do it again on pre-filtered data
get.conversions <- function(x) {
conversions <- strsplit(x$conversions,';')
conv.tables <- list()
for (i in 1:length(unique(x$sample))) {
result.table <- matrix(nrow=4, ncol=4, 0)
colnames(result.table) <- c('A', 'T', 'C', 'G')
rownames(result.table) <- c('A', 'T', 'C', 'G')
sample.conversions <- conversions[which(x$sample == unique(x$sample)[i])]
for (j in 1:length(sample.conversions)) {
for (k in 1:length(sample.conversions[[j]])) {
conv.val <- unlist(strsplit(sample.conversions[[j]][k],'='))
conv <- unlist(strsplit(conv.val[1],':'))
result.table[conv[1], conv[2]] <- result.table[conv[1], conv[2]] + as.numeric(conv.val[2])
}
}
conv.tables[[i]] <- result.table
}
names(conv.tables) <- unique(x$sample)
return(conv.tables)
}
## Given the questions over the quality of the data, I'm not sure they will be particularly helpful, but we can create them so:
conversion.tables <- get.conversions(dd.filtered)
print(conversion.tables)
## $DCKP93B_BM_5DPOG
## A T C G
## A 0 79 118 866
## T 79 0 1054 134
## C 135 394 0 64
## G 303 111 163 0
##
## $DCKP103B_BM_5DPOG
## A T C G
## A 0 127 191 2054
## T 97 0 2182 94
## C 128 268 0 8
## G 219 150 31 0
##
## $DCKP141F_BM_5DPOG
## A T C G
## A 0 433 837 1518
## T 396 0 1606 533
## C 133 745 0 83
## G 536 151 23 0
##
## $DCKP141G_BM_5DPOG
## A T C G
## A 0 191 247 990
## T 666 0 1411 662
## C 673 204 0 130
## G 360 662 98 0
##
## $DCKP106C_BM_5DPOG
## A T C G
## A 0 698 300 1180
## T 1594 0 1384 280
## C 812 1481 0 31
## G 1125 157 300 0
##
## $DCKP141A_BM_5DPOG
## A T C G
## A 0 207 291 1425
## T 348 0 1420 360
## C 159 846 0 272
## G 476 342 200 0
##
## $DCKP141F_SP_5DPOG
## A T C G
## A 0 230 196 997
## T 698 0 2208 144
## C 208 312 0 280
## G 264 203 18 0
##
## $DCKP141G_SP_5DPOG
## A T C G
## A 0 156 131 689
## T 170 0 775 84
## C 188 273 0 47
## G 197 169 73 0
##
## $DCKP106C_SP_5DPOG
## A T C G
## A 0 139 80 886
## T 72 0 1268 79
## C 178 156 0 14
## G 403 115 24 0
##
## $DCKP141A_SP_5DPOG
## A T C G
## A 0 392 188 989
## T 833 0 1526 674
## C 313 922 0 244
## G 459 188 412 0
##
## $DCKP93C_BM_2MPOG
## A T C G
## A 0 356 584 1557
## T 350 0 1782 466
## C 274 999 0 167
## G 801 401 181 0
##
## $DCKP93D_BM_2MPOG
## A T C G
## A 0 147 208 1283
## T 287 0 1178 288
## C 536 964 0 166
## G 406 274 59 0
##
## $DCKP141E_BM_2MPOG
## A T C G
## A 0 74 165 880
## T 97 0 1045 106
## C 216 254 0 19
## G 169 196 42 0
##
## $DCKP141D_BM_2MPOG
## A T C G
## A 0 118 194 1216
## T 440 0 1363 397
## C 251 362 0 56
## G 276 242 43 0
##
## $DCKP141B_BM_2MPOG
## A T C G
## A 0 71 210 669
## T 85 0 641 90
## C 190 193 0 24
## G 131 211 29 0
##
## $DCKP141E_SP_2MPOG
## A T C G
## A 0 126 154 1281
## T 177 0 1253 221
## C 334 307 0 123
## G 527 183 43 0
##
## $DCKP141D_SP_2MPOG
## A T C G
## A 0 190 285 1255
## T 173 0 1593 335
## C 261 635 0 30
## G 550 237 39 0
##
## $DCKP141C_SP_2MPOG
## A T C G
## A 0 121 208 1130
## T 120 0 890 107
## C 133 290 0 36
## G 193 179 1009 0
##
## $DCKP141B_SP_2MPOG
## A T C G
## A 0 91 200 983
## T 1341 0 883 112
## C 177 1446 0 18
## G 1383 177 28 0
##
## $DCKP151B_BM_4MPOG
## A T C G
## A 0 71 194 722
## T 171 0 767 131
## C 184 226 0 31
## G 208 158 35 0
##
## $DCKP141I_BM_4MPOG
## A T C G
## A 0 77 178 942
## T 135 0 955 136
## C 183 271 0 55
## G 161 204 21 0
##
## $DCKP141H_BM_4MPOG
## A T C G
## A 0 46 115 799
## T 936 0 704 106
## C 155 179 0 13
## G 122 106 23 0
##
## $DCKP151B_SP_4MPOG
## A T C G
## A 0 70 134 699
## T 83 0 761 62
## C 173 183 0 9
## G 137 108 19 0
##
## $DCKP141I_SP_4MPOG
## A T C G
## A 0 47 112 505
## T 63 0 566 63
## C 107 139 0 16
## G 106 101 25 0
##
## $DCKP141H_SP_4MPOG
## A T C G
## A 0 80 180 785
## T 465 0 1277 831
## C 191 229 0 21
## G 195 280 20 0
##
## $DCKP92g_BM_5MPOG
## A T C G
## A 0 77 96 822
## T 152 0 694 168
## C 214 221 0 18
## G 222 221 37 0
##
## $DCKP92e_SP_5MPOG
## A T C G
## A 0 2 4 10
## T 5 0 23 5
## C 8 8 0 1
## G 8 4 6 0
##
## $DCKP103A_SP_9MPOG
## A T C G
## A 0 7 29 190
## T 16 0 265 11
## C 28 175 0 3
## G 22 29 7 0
##
## $DCKP103C_SP_9MPOG
## A T C G
## A 0 42 121 475
## T 48 0 498 72
## C 86 107 0 18
## G 93 101 19 0
##
## $DCKP106E_SP_9MPOG
## A T C G
## A 0 0 1 1
## T 1 0 6 1
## C 3 0 0 0
## G 2 1 0 0
##
## $DCKP151C_SP_9MPOG
## A T C G
## A 0 16 29 198
## T 20 0 293 16
## C 39 56 0 5
## G 35 35 8 0
##
## $DCKP106F_BM_9MPOG
## A T C G
## A 0 796 917 1119
## T 117 0 1017 107
## C 142 258 0 15
## G 184 188 733 0
##
## $DCKP151C_BM_9MPOG
## A T C G
## A 0 0 0 0
## T 0 0 1 0
## C 0 1 0 0
## G 0 0 0 0
##
## $DCKP251A_SP_POSCTRL
## A T C G
## A 0 40 102 611
## T 83 0 832 70
## C 128 166 0 20
## G 113 113 21 0
##
## $DCKP251B_SP_POSCTRL
## A T C G
## A 0 67 134 851
## T 1648 0 2273 78
## C 184 1224 0 13
## G 140 146 22 0
##
## $DCKP251C_SP_POSCTRL
## A T C G
## A 0 13 18 156
## T 18 0 163 15
## C 25 41 0 3
## G 35 27 3 0
##
## $DCKP212E_SP_NEGCTRL
## A T C G
## A 0 52 112 523
## T 52 0 697 73
## C 122 145 0 8
## G 118 86 18 0
##
## $DCKP223H_SP_NEGCTRL
## A T C G
## A 0 18 107 728
## T 32 0 301 28
## C 73 78 0 5
## G 46 92 8 0
##
## $DCKP223I_SP_NEGCTRL
## A T C G
## A 0 48 67 514
## T 42 0 776 68
## C 123 128 0 6
## G 86 100 15 0
# just doind a density plot as I was interested to see it
p <- ggplot(dd.filtered, aes(x=snps, color=group)) +
geom_density() +
xlim(0, 20)
p
## we can write two different tables, the table that underlies the pie charts and the raw (filtered) data that you can then manipulate as you like:
my.results.fn <- paste(working.dir,'r_output_files/','pie_chart_data.csv', sep = "")
dd.filtered.fn <- paste(working.dir,'r_output_files/','raw_filtered_data.csv', sep = "")
write.csv(my.results, my.results.fn)
write.csv(dd.filtered, dd.filtered.fn)